The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Mon Jun 15 20:53:50 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Mon Jun 15 20:53:50 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.14.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.14.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.14.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.14.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 145 145
Albania 145 145
Algeria 145 145
Andorra 145 145
Angola 145 145
Antigua and Barbuda 145 145
Argentina 145 145
Armenia 145 145
Australia 1160 1160
Austria 145 145
Azerbaijan 145 145
Bahamas 145 145
Bahrain 145 145
Bangladesh 145 145
Barbados 145 145
Belarus 145 145
Belgium 145 145
Belize 145 145
Benin 145 145
Bhutan 145 145
Bolivia 145 145
Bosnia and Herzegovina 145 145
Botswana 145 145
Brazil 145 145
Brunei 145 145
Bulgaria 145 145
Burkina Faso 145 145
Burma 145 145
Burundi 145 145
Cabo Verde 145 145
Cambodia 145 145
Cameroon 145 145
Canada 2030 2030
Central African Republic 145 145
Chad 145 145
Chile 145 145
China 4785 4785
Colombia 145 145
Comoros 145 145
Congo (Brazzaville) 145 145
Congo (Kinshasa) 145 145
Costa Rica 145 145
Cote d’Ivoire 145 145
Croatia 145 145
Cuba 145 145
Cyprus 145 145
Czechia 145 145
Denmark 435 435
Diamond Princess 145 145
Djibouti 145 145
Dominica 145 145
Dominican Republic 145 145
Ecuador 145 145
Egypt 145 145
El Salvador 145 145
Equatorial Guinea 145 145
Eritrea 145 145
Estonia 145 145
Eswatini 145 145
Ethiopia 145 145
Fiji 145 145
Finland 145 145
France 1595 1595
Gabon 145 145
Gambia 145 145
Georgia 145 145
Germany 145 145
Ghana 145 145
Greece 145 145
Grenada 145 145
Guatemala 145 145
Guinea 145 145
Guinea-Bissau 145 145
Guyana 145 145
Haiti 145 145
Holy See 145 145
Honduras 145 145
Hungary 145 145
Iceland 145 145
India 145 145
Indonesia 145 145
Iran 145 145
Iraq 145 145
Ireland 145 145
Israel 145 145
Italy 145 145
Jamaica 145 145
Japan 145 145
Jordan 145 145
Kazakhstan 145 145
Kenya 145 145
Korea, South 145 145
Kosovo 145 145
Kuwait 145 145
Kyrgyzstan 145 145
Laos 145 145
Latvia 145 145
Lebanon 145 145
Lesotho 145 145
Liberia 145 145
Libya 145 145
Liechtenstein 145 145
Lithuania 145 145
Luxembourg 145 145
Madagascar 145 145
Malawi 145 145
Malaysia 145 145
Maldives 145 145
Mali 145 145
Malta 145 145
Mauritania 145 145
Mauritius 145 145
Mexico 145 145
Moldova 145 145
Monaco 145 145
Mongolia 145 145
Montenegro 145 145
Morocco 145 145
Mozambique 145 145
MS Zaandam 145 145
Namibia 145 145
Nepal 145 145
Netherlands 725 725
New Zealand 145 145
Nicaragua 145 145
Niger 145 145
Nigeria 145 145
North Macedonia 145 145
Norway 145 145
Oman 145 145
Pakistan 145 145
Panama 145 145
Papua New Guinea 145 145
Paraguay 145 145
Peru 145 145
Philippines 145 145
Poland 145 145
Portugal 145 145
Qatar 145 145
Romania 145 145
Russia 145 145
Rwanda 145 145
Saint Kitts and Nevis 145 145
Saint Lucia 145 145
Saint Vincent and the Grenadines 145 145
San Marino 145 145
Sao Tome and Principe 145 145
Saudi Arabia 145 145
Senegal 145 145
Serbia 145 145
Seychelles 145 145
Sierra Leone 145 145
Singapore 145 145
Slovakia 145 145
Slovenia 145 145
Somalia 145 145
South Africa 145 145
South Sudan 145 145
Spain 145 145
Sri Lanka 145 145
Sudan 145 145
Suriname 145 145
Sweden 145 145
Switzerland 145 145
Syria 145 145
Taiwan* 145 145
Tajikistan 145 145
Tanzania 145 145
Thailand 145 145
Timor-Leste 145 145
Togo 145 145
Trinidad and Tobago 145 145
Tunisia 145 145
Turkey 145 145
Uganda 145 145
Ukraine 145 145
United Arab Emirates 145 145
United Kingdom 1595 1595
Uruguay 145 145
US 145 145
US_state 472845 472845
Uzbekistan 145 145
Venezuela 145 145
Vietnam 145 145
West Bank and Gaza 145 145
Western Sahara 145 145
Yemen 145 145
Zambia 145 145
Zimbabwe 145 145
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5553
Alaska 1108
Arizona 1361
Arkansas 5880
California 5187
Colorado 5027
Connecticut 808
Delaware 339
Diamond Princess 90
District of Columbia 91
Florida 5858
Georgia 13230
Grand Princess 91
Guam 91
Hawaii 467
Idaho 2689
Illinois 7763
Indiana 7665
Iowa 7280
Kansas 6149
Kentucky 8746
Louisiana 5515
Maine 1406
Maryland 2129
Massachusetts 1351
Michigan 6820
Minnesota 6478
Mississippi 6844
Missouri 7811
Montana 2406
Nebraska 4723
Nevada 1086
New Hampshire 961
New Jersey 2037
New Mexico 2388
New York 5193
North Carolina 8136
North Dakota 2927
Northern Mariana Islands 76
Ohio 7237
Oklahoma 5600
Oregon 2787
Pennsylvania 5673
Puerto Rico 91
Rhode Island 538
South Carolina 3973
South Dakota 3708
Tennessee 7853
Texas 16993
Utah 1240
Vermont 1290
Virgin Islands 91
Virginia 10322
Washington 3541
West Virginia 3861
Wisconsin 5570
Wyoming 1734
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 60097
China 145
Diamond Princess 126
Korea, South 116
Japan 115
Italy 113
Iran 110
Singapore 107
France 106
Germany 106
Spain 105
US 103
Switzerland 102
United Kingdom 102
Belgium 101
Netherlands 101
Norway 101
Sweden 101
Austria 99
Malaysia 98
Australia 97
Bahrain 97
Denmark 97
Canada 96
Qatar 96
Iceland 95
Brazil 94
Czechia 94
Finland 94
Greece 94
Iraq 94
Israel 94
Portugal 94
Slovenia 94
Egypt 93
Estonia 93
India 93
Ireland 93
Kuwait 93
Philippines 93
Poland 93
Romania 93
Saudi Arabia 93
Indonesia 92
Lebanon 92
Pakistan 92
San Marino 92
Thailand 92
Chile 91
Luxembourg 90
Peru 90
Russia 90
Ecuador 89
Mexico 89
Slovakia 89
South Africa 89
United Arab Emirates 89
Armenia 88
Colombia 88
Croatia 88
Panama 88
Serbia 88
Taiwan* 88
Turkey 88
Argentina 87
Bulgaria 87
Latvia 87
Uruguay 87
Algeria 86
Costa Rica 86
Dominican Republic 86
Hungary 86
Andorra 85
Bosnia and Herzegovina 85
Jordan 85
Lithuania 85
Morocco 85
New Zealand 85
North Macedonia 85
Vietnam 85
Albania 84
Cyprus 84
Malta 84
Moldova 84
Brunei 83
Burkina Faso 83
Sri Lanka 83
Tunisia 83
Ukraine 82
Azerbaijan 81
Ghana 81
Kazakhstan 81
Oman 81
Senegal 81
Venezuela 81
Afghanistan 80
Cote d’Ivoire 80
Cuba 79
Mauritius 79
Uzbekistan 79
Cambodia 78
Cameroon 78
Honduras 78
Nigeria 78
West Bank and Gaza 78
Belarus 77
Georgia 77
Bolivia 76
Kosovo 76
Kyrgyzstan 76
Montenegro 76
Congo (Kinshasa) 75
Kenya 74
Niger 73
Guinea 72
Rwanda 72
Trinidad and Tobago 72
Paraguay 71
Bangladesh 70
Djibouti 68
El Salvador 67
Guatemala 66
Madagascar 65
Mali 64
Congo (Brazzaville) 61
Jamaica 61
Gabon 59
Somalia 59
Tanzania 59
Ethiopia 58
Burma 57
Sudan 56
Liberia 55
Maldives 53
Equatorial Guinea 52
Cabo Verde 50
Sierra Leone 48
Guinea-Bissau 47
Togo 47
Zambia 46
Eswatini 45
Chad 44
Tajikistan 43
Haiti 41
Sao Tome and Principe 41
Benin 39
Nepal 39
Uganda 39
Central African Republic 38
South Sudan 38
Guyana 36
Mozambique 35
Yemen 31
Mongolia 30
Mauritania 27
Nicaragua 27
Malawi 21
Syria 21
Zimbabwe 19
Bahamas 18
Libya 18
Comoros 16
Suriname 8
Angola 5
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 145
Korea, South 116
Japan 115
Italy 113
Iran 110
Singapore 107
France 106
Germany 106
Spain 105
US 103
Switzerland 102
United Kingdom 102
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-02-16        18308
## 2    Afghanistan           <NA> <NA> 2020-02-08        18300
## 3    Afghanistan           <NA> <NA> 2020-02-23        18315
## 4    Afghanistan           <NA> <NA> 2020-02-27        18319
## 5    Afghanistan           <NA> <NA> 2020-02-18        18310
## 6    Afghanistan           <NA> <NA> 2020-02-19        18311
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                     0            NaN
## 2                      0                     0            NaN
## 3                      0                     0            NaN
## 4                      0                     1              0
## 5                      0                     0            NaN
## 6                      0                     0            NaN
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                      -Inf                       -Inf        18348
## 2                      -Inf                       -Inf        18348
## 3                      -Inf                       -Inf        18348
## 4                         0                       -Inf        18348
## 5                      -Inf                       -Inf        18348
## 6                      -Inf                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1            -40  NA   NA         NA                           NA
## 2            -48  NA   NA         NA                           NA
## 3            -33  NA   NA         NA                           NA
## 4            -29  NA   NA         NA                           NA
## 5            -38  NA   NA         NA                           NA
## 6            -37  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Grocery_Pharmacy 0.9963383 -34.0
Hawaii Retail_Recreation 0.9766920 -56.0
New Hampshire Parks 0.9514852 -20.0
Vermont Parks 0.9334610 -35.5
Maine Transit -0.9160412 -50.0
Connecticut Grocery_Pharmacy -0.8884857 -6.0
Hawaii Parks 0.8831983 -72.0
Utah Residential -0.8812108 12.0
Hawaii Transit 0.8416579 -89.0
Utah Transit -0.8382427 -18.0
South Dakota Parks 0.8120645 -26.0
Arizona Grocery_Pharmacy -0.7790018 -15.0
Wyoming Parks -0.7731578 -4.0
Rhode Island Workplace -0.7648509 -39.5
Connecticut Transit -0.7586011 -50.0
Alaska Workplace -0.7547141 -33.0
Massachusetts Workplace -0.7501963 -39.0
Alaska Grocery_Pharmacy -0.7293560 -7.0
Arizona Residential 0.6723355 13.0
New York Workplace -0.6524524 -34.5
Vermont Grocery_Pharmacy -0.6523212 -25.0
Alaska Residential 0.6454484 13.0
Nevada Transit -0.6372285 -20.0
North Dakota Parks 0.6333846 -34.0
Utah Parks -0.6265565 17.0
Rhode Island Retail_Recreation -0.6264114 -45.0
New Jersey Parks -0.6043456 -6.0
Rhode Island Residential -0.6030669 18.5
Maine Workplace -0.5934180 -30.0
New York Retail_Recreation -0.5897424 -46.0
Nebraska Workplace 0.5826442 -32.0
Idaho Residential -0.5780046 11.0
Utah Workplace -0.5685366 -37.0
Arizona Retail_Recreation -0.5602471 -42.5
Hawaii Residential -0.5400111 19.0
New Jersey Workplace -0.5393479 -44.0
New York Parks 0.5297005 20.0
Connecticut Retail_Recreation -0.5238394 -45.0
Connecticut Residential 0.5171354 14.0
Massachusetts Retail_Recreation -0.4942678 -44.0
Kansas Parks 0.4910310 72.0
Maine Parks 0.4895937 -31.0
New Jersey Grocery_Pharmacy -0.4847056 2.5
New Mexico Grocery_Pharmacy -0.4820096 -11.0
Connecticut Workplace -0.4814503 -39.0
Montana Parks -0.4746946 -58.0
West Virginia Parks 0.4709439 -33.0
Nebraska Residential -0.4663861 14.0
Iowa Parks -0.4647028 28.5
New Mexico Residential 0.4534515 13.5
Rhode Island Parks 0.4526207 52.0
Maryland Workplace -0.4498206 -35.0
North Dakota Retail_Recreation -0.4495267 -42.0
Arizona Transit 0.4494467 -38.0
Arkansas Parks 0.4419897 -12.0
New Jersey Retail_Recreation -0.4321745 -62.5
Illinois Transit -0.4290277 -31.0
Pennsylvania Workplace -0.4202135 -36.0
Wyoming Transit -0.4146827 -17.0
South Carolina Workplace 0.4110464 -30.0
New Mexico Parks 0.4102019 -31.5
Maryland Grocery_Pharmacy -0.4093101 -10.0
New Jersey Transit -0.4087065 -50.5
Vermont Residential 0.4063606 11.5
Massachusetts Grocery_Pharmacy -0.4058063 -7.0
Montana Residential 0.4049397 14.0
New Hampshire Residential -0.3989852 14.0
Pennsylvania Retail_Recreation -0.3914364 -45.0
Michigan Parks 0.3907013 28.5
Kentucky Parks -0.3873871 28.5
Alabama Workplace -0.3870248 -29.0
Oregon Workplace -0.3863588 -31.0
Alabama Grocery_Pharmacy -0.3765508 -2.0
New York Transit -0.3676569 -48.0
New Mexico Retail_Recreation -0.3557197 -42.5
Oregon Parks -0.3510103 16.5
Wisconsin Transit -0.3464747 -23.5
Nebraska Grocery_Pharmacy 0.3414901 -0.5
Wyoming Workplace -0.3399230 -31.0
Idaho Workplace -0.3329882 -29.0
Idaho Grocery_Pharmacy -0.3323841 -5.5
Arkansas Retail_Recreation -0.3249775 -30.0
North Dakota Workplace 0.3247268 -40.0
Maryland Retail_Recreation -0.3225080 -39.0
Wyoming Grocery_Pharmacy -0.3202924 -10.0
California Transit -0.3177120 -42.0
California Residential 0.3160515 14.0
Virginia Transit -0.3150046 -33.0
Missouri Residential -0.3126935 13.0
Maine Retail_Recreation -0.3049277 -42.0
Alaska Retail_Recreation 0.2987923 -39.0
California Parks -0.2976316 -38.5
Idaho Transit -0.2947824 -30.0
Colorado Residential 0.2895601 14.0
Illinois Workplace -0.2895041 -30.5
Florida Residential 0.2888797 14.0
Minnesota Transit -0.2881512 -28.5
Nevada Residential 0.2849230 17.0
Montana Transit 0.2838082 -41.0
Pennsylvania Parks 0.2798228 12.0
Vermont Retail_Recreation 0.2796369 -57.0
Mississippi Residential 0.2772882 13.0
North Carolina Grocery_Pharmacy 0.2749748 0.0
South Carolina Parks -0.2728081 -23.0
Pennsylvania Grocery_Pharmacy -0.2647590 -6.0
North Carolina Workplace 0.2636928 -31.0
Rhode Island Grocery_Pharmacy 0.2634588 -7.5
Georgia Grocery_Pharmacy -0.2615915 -10.0
Vermont Workplace -0.2605063 -43.0
Kansas Workplace 0.2578891 -32.5
Maryland Residential 0.2574426 15.0
Alabama Transit -0.2566305 -36.5
Texas Workplace 0.2550444 -32.0
Rhode Island Transit -0.2528446 -56.0
Illinois Parks 0.2501446 26.5
Texas Residential -0.2497130 15.0
Tennessee Workplace -0.2482251 -31.0
Florida Parks -0.2395751 -43.0
Tennessee Residential 0.2350379 11.5
Hawaii Workplace 0.2347653 -46.0
West Virginia Grocery_Pharmacy -0.2343967 -6.0
Georgia Workplace -0.2311928 -33.5
New York Grocery_Pharmacy -0.2292318 8.0
Wisconsin Parks 0.2262361 51.5
Washington Grocery_Pharmacy 0.2233943 -7.0
Georgia Retail_Recreation -0.2197271 -41.0
South Dakota Workplace 0.2157919 -35.0
Minnesota Parks 0.2121428 -9.0
North Carolina Transit 0.2107094 -32.0
Connecticut Parks 0.2050403 43.0
Arizona Parks -0.2014660 -44.5
Nevada Retail_Recreation -0.1999971 -43.0
West Virginia Workplace 0.1987680 -33.0
North Dakota Grocery_Pharmacy -0.1985338 -8.0
Mississippi Grocery_Pharmacy -0.1974552 -8.0
Kansas Grocery_Pharmacy -0.1956145 -14.0
Missouri Workplace 0.1938239 -29.0
North Carolina Residential 0.1932392 13.0
Colorado Parks -0.1915265 2.0
Illinois Residential 0.1911106 14.0
Kentucky Transit -0.1896974 -31.0
Indiana Parks -0.1888326 29.0
Idaho Retail_Recreation -0.1878367 -39.5
Utah Retail_Recreation -0.1853690 -40.0
Nebraska Transit -0.1851402 -9.0
Virginia Residential 0.1849705 14.0
Oregon Transit 0.1843150 -27.5
Texas Parks 0.1773264 -42.0
Wisconsin Residential -0.1772745 14.0
Kentucky Grocery_Pharmacy 0.1764460 4.0
Oregon Residential 0.1737891 10.5
Nevada Workplace -0.1736066 -40.0
Pennsylvania Transit -0.1662193 -42.0
New Hampshire Retail_Recreation -0.1655429 -41.0
Tennessee Parks -0.1641634 10.5
Indiana Residential 0.1640844 12.0
Alabama Parks 0.1637611 -1.0
Montana Workplace -0.1627080 -40.0
Massachusetts Parks 0.1619486 39.0
New Jersey Residential 0.1603169 18.0
New Mexico Transit 0.1598454 -38.5
Ohio Transit 0.1585118 -28.0
South Dakota Residential 0.1554263 15.0
Iowa Transit 0.1492106 -24.0
Arkansas Residential 0.1430873 12.0
Michigan Workplace -0.1417351 -40.0
California Grocery_Pharmacy -0.1396868 -11.5
Virginia Grocery_Pharmacy -0.1394392 -8.0
Minnesota Retail_Recreation 0.1377797 -40.5
Indiana Retail_Recreation 0.1354930 -38.0
North Dakota Residential -0.1352837 17.0
Missouri Transit -0.1349824 -24.5
North Dakota Transit 0.1309745 -48.0
Missouri Parks 0.1280668 0.0
Mississippi Retail_Recreation -0.1273404 -40.0
Wisconsin Workplace -0.1238713 -31.0
Texas Grocery_Pharmacy 0.1217290 -14.0
California Retail_Recreation -0.1216541 -44.0
Mississippi Transit -0.1205313 -38.5
Wisconsin Grocery_Pharmacy 0.1182494 -1.0
North Carolina Retail_Recreation 0.1176920 -34.0
Florida Retail_Recreation 0.1145571 -43.0
Wyoming Residential 0.1142591 12.5
Kentucky Residential 0.1108641 12.0
Nebraska Retail_Recreation 0.1066044 -36.0
Arkansas Workplace -0.1058634 -26.0
Idaho Parks 0.1043225 -22.0
Michigan Residential 0.1042007 15.0
Mississippi Parks -0.1040065 -25.0
Oklahoma Grocery_Pharmacy -0.1035375 -0.5
Maryland Transit -0.1030099 -39.0
Virginia Parks 0.1026739 6.0
Massachusetts Transit -0.1025045 -45.0
Iowa Workplace 0.1021984 -30.0
New Hampshire Grocery_Pharmacy -0.1012904 -6.0
Kansas Transit -0.1010574 -26.5
Massachusetts Residential 0.1007376 15.0
Oklahoma Parks 0.1002056 -18.5
Utah Grocery_Pharmacy 0.0979261 -4.0
Michigan Transit 0.0975602 -46.0
Michigan Retail_Recreation -0.0968806 -53.0
Ohio Residential 0.0929236 14.0
Minnesota Grocery_Pharmacy 0.0919075 -6.0
New York Residential 0.0912006 17.5
Alabama Retail_Recreation 0.0907760 -39.0
Indiana Workplace 0.0891283 -34.0
Oregon Grocery_Pharmacy 0.0890818 -7.0
Nevada Parks 0.0887461 -12.5
Oklahoma Workplace 0.0876863 -31.0
Texas Transit 0.0874214 -41.0
Virginia Workplace -0.0868513 -31.5
South Dakota Transit -0.0866723 -40.0
Iowa Grocery_Pharmacy -0.0863169 4.0
West Virginia Residential -0.0855932 11.0
Georgia Residential -0.0854179 13.0
Alabama Residential -0.0845941 11.0
Ohio Parks -0.0837697 67.5
Georgia Parks 0.0806098 -6.0
Oregon Retail_Recreation -0.0790253 -40.5
Minnesota Residential -0.0789516 17.0
Pennsylvania Residential 0.0767490 15.0
Virginia Retail_Recreation -0.0762239 -35.0
Washington Workplace -0.0740569 -38.0
Ohio Grocery_Pharmacy 0.0727723 0.0
Kentucky Retail_Recreation 0.0720589 -29.0
West Virginia Transit -0.0717074 -45.0
North Carolina Parks -0.0680372 7.0
Colorado Transit 0.0647873 -36.0
Minnesota Workplace -0.0646315 -33.0
Maine Residential -0.0640628 11.0
Texas Retail_Recreation 0.0612312 -40.0
Washington Residential 0.0608472 13.0
Vermont Transit -0.0576777 -63.0
Washington Transit -0.0574485 -33.5
Florida Grocery_Pharmacy 0.0571168 -14.0
California Workplace -0.0561057 -36.0
Florida Transit -0.0552064 -49.0
Montana Retail_Recreation 0.0539574 -50.0
Kentucky Workplace -0.0488470 -36.5
Ohio Retail_Recreation 0.0487020 -36.0
South Dakota Retail_Recreation -0.0475723 -39.0
New Hampshire Workplace 0.0470651 -37.0
Washington Parks 0.0465631 -3.5
Indiana Transit 0.0453982 -29.0
Georgia Transit -0.0413530 -35.0
Missouri Retail_Recreation -0.0405321 -36.0
Ohio Workplace -0.0372655 -35.0
Illinois Grocery_Pharmacy -0.0362776 2.0
Iowa Retail_Recreation 0.0352999 -38.0
Mississippi Workplace -0.0327478 -33.0
Wisconsin Retail_Recreation 0.0306580 -44.0
Arizona Workplace -0.0294025 -35.0
New Mexico Workplace 0.0288480 -34.0
Arkansas Grocery_Pharmacy -0.0280118 3.0
South Carolina Grocery_Pharmacy 0.0273745 1.0
Tennessee Transit 0.0268989 -32.0
Colorado Grocery_Pharmacy -0.0264980 -17.0
Maine Grocery_Pharmacy -0.0264511 -13.0
West Virginia Retail_Recreation -0.0241022 -38.5
South Carolina Transit 0.0238646 -45.0
Michigan Grocery_Pharmacy -0.0226895 -11.0
Washington Retail_Recreation 0.0218516 -42.0
Indiana Grocery_Pharmacy -0.0218129 -5.5
Illinois Retail_Recreation 0.0206889 -40.0
Colorado Retail_Recreation -0.0203766 -44.0
Kansas Residential -0.0192692 13.0
Tennessee Grocery_Pharmacy 0.0185327 6.0
South Carolina Residential -0.0181932 12.0
New Hampshire Transit 0.0177586 -57.0
Tennessee Retail_Recreation -0.0168981 -30.0
Colorado Workplace 0.0165158 -39.0
Nebraska Parks 0.0160138 55.5
Oklahoma Residential 0.0145110 15.0
Iowa Residential -0.0143971 13.0
Maryland Parks 0.0139789 27.0
Oklahoma Transit 0.0131899 -26.0
Missouri Grocery_Pharmacy 0.0124944 2.0
Wyoming Retail_Recreation -0.0115305 -39.0
South Dakota Grocery_Pharmacy -0.0110236 -9.0
Arkansas Transit 0.0108231 -27.0
Florida Workplace -0.0090412 -33.0
Kansas Retail_Recreation -0.0071101 -38.0
Montana Grocery_Pharmacy -0.0069173 -16.0
South Carolina Retail_Recreation -0.0050199 -35.0
Oklahoma Retail_Recreation 0.0035061 -31.0
Nevada Grocery_Pharmacy 0.0016930 -12.5
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Mon Jun 15 20:55:39 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net